We will, again, use data from Gapminder for the following exercises on relational data.

There is an R package named gapminder that contains a selection of the data from Gapminder.

Our aim for this exercises is to use individual data files from Gapminder to create combined datasets that resemble the one included in the gapminder package.

Before we can begin with these exercises, we need to load the datasets. To do this, you can simply use the following code (that repeats some parts of the previous exercises).

library(tidyverse)

gap_cont <- read_csv("../data/countries_continent.csv")

gap_life <- read_csv("../data/life_expectancy_years.csv")

gap_pop <- read_csv("../data/population_total.csv")

gap_gdp <- read_csv("../data/gdppercapita_us_inflation_adjusted.csv")

gap_fert <- read_csv("../data/children_per_woman_total_fertility.csv")

gap_cont <- gap_cont %>% 
  mutate(continent = as_factor(continent))

First of all, let’s have a look at the Gapminder data that is included in the gapminder package.

1

Install and load the gapminder package and have a look at the dataset it contains. The dataset we want is simply called gapminder.
In order to print the dataset, you can just type its name and run that line.
if (!require(gapminder)) install.packages("gapminder")
## Loading required package: gapminder
# check if the gapminder package is installed and install it, if that is not the case

library(gapminder)

gapminder
## # A tibble: 1,704 x 6
##    country     continent  year lifeExp      pop gdpPercap
##    <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
##  1 Afghanistan Asia       1952    28.8  8425333      779.
##  2 Afghanistan Asia       1957    30.3  9240934      821.
##  3 Afghanistan Asia       1962    32.0 10267083      853.
##  4 Afghanistan Asia       1967    34.0 11537966      836.
##  5 Afghanistan Asia       1972    36.1 13079460      740.
##  6 Afghanistan Asia       1977    38.4 14880372      786.
##  7 Afghanistan Asia       1982    39.9 12881816      978.
##  8 Afghanistan Asia       1987    40.8 13867957      852.
##  9 Afghanistan Asia       1992    41.7 16317921      649.
## 10 Afghanistan Asia       1997    41.8 22227415      635.
## # ... with 1,694 more rows

In the following, we will use different joins to create datasets that contain the same set of variables. We will create two versions of the combined dataset.

Before we do this, however, we want to explore the overlap and discrepancies between the individual datasets. This is somewhat easier to do with the datasets in wide format (as each country name only appears in one row in those).

2

Using a filtering join, determine which countries are in gap_pop but not in gap_life and vice versa. Do the same for gap_gdp and gap_life. Do not assign the results to objects, just print them to the console.
You should use anti_join() for this task. To just get the country names, you can select the country variable.
gap_pop %>% 
  anti_join(gap_life, by = "country") %>% 
  select(country)
## # A tibble: 8 x 1
##   country            
##   <chr>              
## 1 Holy See           
## 2 Liechtenstein      
## 3 Monaco             
## 4 Nauru              
## 5 Palau              
## 6 San Marino         
## 7 St. Kitts and Nevis
## 8 Tuvalu
gap_life %>% 
  anti_join(gap_pop, by = "country") %>% 
  select(country)
## # A tibble: 0 x 1
## # ... with 1 variable: country <chr>
gap_gdp %>% 
  anti_join(gap_life, by = "country") %>% 
  select(country)
## # A tibble: 7 x 1
##   country            
##   <chr>              
## 1 Liechtenstein      
## 2 Monaco             
## 3 Nauru              
## 4 Palau              
## 5 San Marino         
## 6 St. Kitts and Nevis
## 7 Tuvalu
gap_life %>% 
  anti_join(gap_gdp, by = "country") %>% 
  select(country)
## # A tibble: 3 x 1
##   country    
##   <chr>      
## 1 North Korea
## 2 Somalia    
## 3 Syria

For the following series of joins, we want the data in the long format again. As we have not really discussed this topic (only briefly mentioned it), just copy, paste, and run the code below to transform the datasets accordingly.

gap_life <- gap_life %>% 
  pivot_longer(-country,
               names_to = "year",
               values_to = "life_exp") %>% 
  mutate(year = as.numeric(year))

gap_pop <- gap_pop %>% 
  pivot_longer(-country,
               names_to = "year",
               values_to = "pop") %>% 
  mutate(year = as.numeric(year))

gap_gdp <- gap_gdp %>% 
  pivot_longer(-country,
               names_to = "year",
               values_to = "gdp_percap") %>% 
  mutate(year = as.numeric(year))

gap_fert <- gap_fert %>% 
  pivot_longer(-country,
               names_to = "year",
               values_to = "fert") %>% 
  mutate(year = as.numeric(year))

As stated above, we want to create two different versions of the combined datasets: One without missing data and one with as many observations (rows) as possible.

3

Create the two versions of the combined dataset described above using one type of mutating join for each one.

For the one without missing data you should start with the dataset with the largest number of countries in it (gap_pop), and then join the other datasets in descending order of the number of countries they contain (gap_life, gap_gdp, gap_cont). This datasets also requires two additional (pipe) steps to ensure that a) it contains no NAs and b) its variables are in the same order as in the gapminder package dataset.

For the dataset with the maximum amount of observations (rows), it does not matter which dataset you start from. Name the resulting datasets gap_nomiss and gap_full.

The types of mutating joins that you should use are inner_join() and full_join. The variables you should join on are country and year.

Remember that the gap_nomiss dataset requires two additional steps.
gap_full <- gap_cont %>% 
  full_join(gap_life, by = "country") %>% 
  full_join(gap_pop, by = c("country", "year")) %>% 
  full_join(gap_gdp, by = c("country", "year"))

gap_nomiss <- gap_pop %>% 
  inner_join(gap_life, by = c("country", "year")) %>% 
  inner_join(gap_gdp, by = c("country", "year")) %>%
  inner_join(gap_cont, by = "country") %>% 
  drop_na() %>% 
  select(country, continent, year, life_exp, pop, gdp_percap)

4

As a final task for this exercise on relational data, check how the two datasets you have just created differ. They should have the same number of columns/variables, but different numbers of rows.
dim(gap_full)
## [1] 58703     6
dim(gap_nomiss)
## [1] 6644    6